Use lazy descriptor pool allocation #2285

SS-JIA · 2024-03-06T23:17:43Z

Summary:

Context

In Vulkan, memory for Descriptor Sets (which are used to bind data to shader arguments) must be pre-allocated. Previously, the convention is that a large number of descriptor sets are allocated upon creation of a Vulkan Context. While this worked well in Lite Interpreter, where only a global vulkan context is used, it will lead to overallocating descriptor sets in the Vulkan Delegate, where every ComputeGraph has its own dedicated Context.

pytorch/pytorch#121134 allows the Descriptor Set pool to be initialized in a deferred fashion. This means that a ComputeGraph can count the total number of descriptors needed across all the compute shaders that will be encoded, and then allocate a Descriptor Set Pool of the appropriate size.

Implementation Overview

When constructing ComputeGraph, make sure that the descriptor pool config contains 0 for number of max sets. This will ensure that no descriptor pool will be initialized when constructing the graph's api::Context instance
When building the graph, ExecuteNode and PrepackNode will call graph.update_descriptor_counts(shader) upon construction, which allows ComputeGraph to count the total number of descriptor sets needed.
There is a separate descriptor count object for prepack and execute, since they correspond to different command buffers.
Before encoding any command buffers, call graph.prepare() which will construct a descriptor pool config from the descriptor counts.

Notes

One interesting finding is that I had to apply a safety factor to the descriptor counts to prevent the pool from running out of memory. This was reproducible on both Linux and Android.

A more robust design, i.e. as discussed here may be to maintain separate descriptor pools for each layout type. We should revisit this refactor at a later time.

Differential Revision: D54603935

pytorch-bot · 2024-03-06T23:17:46Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/2285

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 2 New Failures, 1 Unrelated Failure

As of commit 1c16288 with merge base 0570294 ():

NEW FAILURES - The following jobs have failed:

Build documentation / build (buck2) / Build doc (gh)
RuntimeError: Command docker exec -t 07cbad0e5fbee7ff6883bd397ea6d21d04923d44c8ff38ecee9ef5452df8d697 /exec failed with exit code 2
pull / unittest / linux (buck2) / linux-job (gh)
RuntimeError: Command docker exec -t 71396fb7b135c6172764175aaac2a048c4a514f25994660745f1f8ac7da2e30b /exec failed with exit code 1

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

Lint / lintrunner / linux-job (gh)
>>> Lint for examples/models/llama2/export_llama_lib.py:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

facebook-github-bot · 2024-03-06T23:17:50Z

This pull request was exported from Phabricator. Differential Revision: D54603935

Summary: ## Context In Vulkan, memory for Descriptor Sets (which are used to bind data to shader arguments) must be pre-allocated. Previously, the convention is that a large number of descriptor sets are allocated upon creation of a Vulkan Context. While this worked well in Lite Interpreter, where only a global vulkan context is used, it will lead to overallocating descriptor sets in the Vulkan Delegate, where every `ComputeGraph` has its own dedicated Context. pytorch/pytorch#121134 allows the Descriptor Set pool to be initialized in a deferred fashion. This means that a ComputeGraph can count the total number of descriptors needed across all the compute shaders that will be encoded, and then allocate a Descriptor Set Pool of the appropriate size. ## Implementation Overview 1. When constructing `ComputeGraph`, make sure that the descriptor pool config contains 0 for number of max sets. This will ensure that no descriptor pool will be initialized when constructing the graph's `api::Context` instance 2. When building the graph, `ExecuteNode` and `PrepackNode` will call `graph.update_descriptor_counts(shader)` upon construction, which allows `ComputeGraph` to count the total number of descriptor sets needed. 3. There is a separate descriptor count object for prepack and execute, since they correspond to different command buffers. 4. Before encoding any command buffers, call `graph.prepare()` which will construct a descriptor pool config from the descriptor counts. ## Notes One interesting finding is that I had to apply a safety factor to the descriptor counts to prevent the pool from running out of memory. This was reproducible on both Linux and Android. A more robust design, i.e. as discussed [here](https://www.reddit.com/r/vulkan/comments/17v66fi/question_about_descriptor_pool_allocations/) may be to maintain separate descriptor pools for each layout type. We should revisit this refactor at a later time. Reviewed By: jorgep31415 Differential Revision: D54603935

facebook-github-bot · 2024-03-07T00:45:01Z

This pull request was exported from Phabricator. Differential Revision: D54603935

facebook-github-bot · 2024-03-07T00:45:56Z

This pull request was exported from Phabricator. Differential Revision: D54603935

facebook-github-bot · 2024-03-07T03:09:34Z

This pull request has been merged in aed32c4.

## Context While testing a toy model with a large number of operators, I ran into an issue on my local Pixel 6 Android device where the descriptor pool was running out of memory. This changeset implements a simple fix to ensure that descriptor pools do not run into this issue. A longer term solution is to implement layout specific descriptor pools, but that is much more technically complex so go with this for now. ## Problem Details #2285 made it so that `ComputeGraph` could tally up the total number of descriptors needed and size the descriptor pools appropriately, but it seems that this is not compatible with certain Vulkan drivers. In the toy model, 1000 binary operators were added. Counting the descriptors required for the graph provides descriptor counts of ``` descriptorPoolMaxSets: 1255 descriptorUniformBufferCount: 5013 descriptorStorageBufferCount: 4 descriptorCombinedSamplerCount: 2504 descriptorStorageImageCount: 1254 ``` Which appears to be correct, however it appears that the descriptor pool runs out of memory due to an insufficient number of `descriptorStorageBufferCount`. The `descriptorStorageBufferCount` needs to be set at a surprisingly high number (approx ~1000) before the descriptor pool does not run out of memory. I'm not sure exactly what causes this behaviour, but it could be due to the implementation details of the driver. ## Solution Ensure that all descriptor counts are at greater than or equal to the maximum number of descriptor sets seems to work. Implement this as a temporary solution. Differential Revision: [D54853788](https://our.internmc.facebook.com/intern/diff/D54853788/) [ghstack-poisoned]

## Context While testing a toy model with a large number of operators, I ran into an issue on my local Pixel 6 Android device where the descriptor pool was running out of memory. This changeset implements a simple fix to ensure that descriptor pools do not run into this issue. A longer term solution is to implement layout specific descriptor pools, but that is much more technically complex so go with this for now. ## Problem Details #2285 made it so that `ComputeGraph` could tally up the total number of descriptors needed and size the descriptor pools appropriately, but it seems that this is not compatible with certain Vulkan drivers. In the toy model, 1000 binary operators were added. Counting the descriptors required for the graph provides descriptor counts of ``` descriptorPoolMaxSets: 1255 descriptorUniformBufferCount: 5013 descriptorStorageBufferCount: 4 descriptorCombinedSamplerCount: 2504 descriptorStorageImageCount: 1254 ``` Which appears to be correct, however it appears that the descriptor pool runs out of memory due to an insufficient number of `descriptorStorageBufferCount`. The `descriptorStorageBufferCount` needs to be set at a surprisingly high number (approx ~1000) before the descriptor pool does not run out of memory. I'm not sure exactly what causes this behaviour, but it could be due to the implementation details of the driver. ## Solution Ensure that all descriptor counts are at greater than or equal to the maximum number of descriptor sets seems to work. Implement this as a temporary solution. Differential Revision: [D54853788](https://our.internmc.facebook.com/intern/diff/D54853788/) ghstack-source-id: 218502788 Pull Request resolved: #2398

…emory" ## Context While testing a toy model with a large number of operators, I ran into an issue on my local Pixel 6 Android device where the descriptor pool was running out of memory. This changeset implements a simple fix to ensure that descriptor pools do not run into this issue. A longer term solution is to implement layout specific descriptor pools, but that is much more technically complex so go with this for now. ## Problem Details #2285 made it so that `ComputeGraph` could tally up the total number of descriptors needed and size the descriptor pools appropriately, but it seems that this is not compatible with certain Vulkan drivers. In the toy model, 1000 binary operators were added. Counting the descriptors required for the graph provides descriptor counts of ``` descriptorPoolMaxSets: 1255 descriptorUniformBufferCount: 5013 descriptorStorageBufferCount: 4 descriptorCombinedSamplerCount: 2504 descriptorStorageImageCount: 1254 ``` Which appears to be correct, however it appears that the descriptor pool runs out of memory due to an insufficient number of `descriptorStorageBufferCount`. The `descriptorStorageBufferCount` needs to be set at a surprisingly high number (approx ~1000) before the descriptor pool does not run out of memory. I'm not sure exactly what causes this behaviour, but it could be due to the implementation details of the driver. ## Solution Ensure that all descriptor counts are at greater than or equal to the maximum number of descriptor sets seems to work. Implement this as a temporary solution. Differential Revision: [D54853788](https://our.internmc.facebook.com/intern/diff/D54853788/) [ghstack-poisoned]

Pull Request resolved: #2398 ## Context While testing a toy model with a large number of operators, I ran into an issue on my local Pixel 6 Android device where the descriptor pool was running out of memory. This changeset implements a simple fix to ensure that descriptor pools do not run into this issue. A longer term solution is to implement layout specific descriptor pools, but that is much more technically complex so go with this for now. ## Problem Details #2285 made it so that `ComputeGraph` could tally up the total number of descriptors needed and size the descriptor pools appropriately, but it seems that this is not compatible with certain Vulkan drivers. In the toy model, 1000 binary operators were added. Counting the descriptors required for the graph provides descriptor counts of ``` descriptorPoolMaxSets: 1255 descriptorUniformBufferCount: 5013 descriptorStorageBufferCount: 4 descriptorCombinedSamplerCount: 2504 descriptorStorageImageCount: 1254 ``` Which appears to be correct, however it appears that the descriptor pool runs out of memory due to an insufficient number of `descriptorStorageBufferCount`. The `descriptorStorageBufferCount` needs to be set at a surprisingly high number (approx ~1000) before the descriptor pool does not run out of memory. I'm not sure exactly what causes this behaviour, but it could be due to the implementation details of the driver. ## Solution Ensure that all descriptor counts are at greater than or equal to the maximum number of descriptor sets seems to work. Implement this as a temporary solution. Differential Revision: [D54853788](https://our.internmc.facebook.com/intern/diff/D54853788/) ghstack-source-id: 218509680

Summary: Pull Request resolved: #2398 ## Context While testing a toy model with a large number of operators, I ran into an issue on my local Pixel 6 Android device where the descriptor pool was running out of memory. This changeset implements a simple fix to ensure that descriptor pools do not run into this issue. A longer term solution is to implement layout specific descriptor pools, but that is much more technically complex so go with this for now. ## Problem Details #2285 made it so that `ComputeGraph` could tally up the total number of descriptors needed and size the descriptor pools appropriately, but it seems that this is not compatible with certain Vulkan drivers. In the toy model, 1000 binary operators were added. Counting the descriptors required for the graph provides descriptor counts of ``` descriptorPoolMaxSets: 1255 descriptorUniformBufferCount: 5013 descriptorStorageBufferCount: 4 descriptorCombinedSamplerCount: 2504 descriptorStorageImageCount: 1254 ``` Which appears to be correct, however it appears that the descriptor pool runs out of memory due to an insufficient number of `descriptorStorageBufferCount`. The `descriptorStorageBufferCount` needs to be set at a surprisingly high number (approx ~1000) before the descriptor pool does not run out of memory. I'm not sure exactly what causes this behaviour, but it could be due to the implementation details of the driver. ## Solution Ensure that all descriptor counts are at greater than or equal to the maximum number of descriptor sets seems to work. Implement this as a temporary solution. ghstack-source-id: 218509680 bypass-github-export-checks bypass-github-pytorch-ci-checks bypass-github-executorch-ci-checks Reviewed By: jorgep31415 Differential Revision: D54853788 fbshipit-source-id: 391e3d10a678672df9af96e3b6a8484453b039f1

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 6, 2024

facebook-github-bot added the fb-exported label Mar 6, 2024

jorgep31415 approved these changes Mar 6, 2024

View reviewed changes

SS-JIA force-pushed the export-D54603935 branch from 2473478 to 95ef28a Compare March 7, 2024 00:44

SS-JIA force-pushed the export-D54603935 branch from 95ef28a to 1c16288 Compare March 7, 2024 00:45

facebook-github-bot closed this in aed32c4 Mar 7, 2024

facebook-github-bot added the Merged label Mar 7, 2024

SS-JIA mentioned this pull request Mar 13, 2024

[ET-VK][ez] Ensure descriptor set pools don't run out of memory #2398

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use lazy descriptor pool allocation #2285

Use lazy descriptor pool allocation #2285

SS-JIA commented Mar 6, 2024

pytorch-bot bot commented Mar 6, 2024 •

edited

Loading

facebook-github-bot commented Mar 6, 2024

facebook-github-bot commented Mar 7, 2024

facebook-github-bot commented Mar 7, 2024

facebook-github-bot commented Mar 7, 2024

Use lazy descriptor pool allocation #2285

Use lazy descriptor pool allocation #2285

Conversation

SS-JIA commented Mar 6, 2024

Context

Implementation Overview

Notes

pytorch-bot bot commented Mar 6, 2024 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/2285

❌ 2 New Failures, 1 Unrelated Failure

facebook-github-bot commented Mar 6, 2024

facebook-github-bot commented Mar 7, 2024

facebook-github-bot commented Mar 7, 2024

facebook-github-bot commented Mar 7, 2024

pytorch-bot bot commented Mar 6, 2024 •

edited

Loading